Introduction

This dataset is from Kaggle. It was entirely scraped via the Goodreads API’s database. The kaggple page creator says that the intention to creat this dataset is to have a clear idea about the books recommendation judging by the number.

According to Wikipedia, Goodreads is a social cataloging website that allows individuals to freely search its database of books, annotations, and reviews. Users can sign up and register books to generate library catalogs and reading lists. They can also create their own groups of book suggestions, surveys, polls, blogs, and discussions. The website’s offices are located in San Francisco. The company is owned by the online retailer Amazon. On July 23, 2013, Goodreads announced on their website their user base had grown to 20 million members, having doubled in close to 11 months.

As one of the world’s most influential reading sites, Goodreads provides a platform for people interested in talking about books. This goodreads data sets contains all the listed books on GoodRead books platform. It contains the books basic infromation, the rating and reviews count. The dataset was updated in 2019 and also it is totally tidy and clean.

Personally, I would like to use this dataset as a reference to make my own reading list.

Here are some questions that I want to find out the answer by analyzing the dataset:


Analysis

The first thing I need to do is import data and have a total view of the whole dataset. Though the dataset is very clean, there are still some parts need to be adjusted. As a result, I delete the irrelevant columns and change the columns names.

(In order to make the report more clean, I use comments inside the code part to explain every steps instead of putting them in the content.)


Here is the result(only shows 150 lines):


Writer Ranking

Top 20 Productive Writers


Ranking by the number of books the authors have published.

As we can see from the graph, the number one most productive writer is Agatha Christie, whose books are 69 included. Then no.2 is Stephen King. Both of them are my favorite writers, but I never have enough time to finish all of their masterpieces.

The top 20 writhers have written over 25 books each. However, there are still more normal authors than popular authors, which means more than half of the writers from this dataset have written only 1 or 2 books.


Book Pages

Top 10 Longest Books


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    31.0   207.0   304.0   351.7   429.8  6576.0

Book Rating

Top 10 Highest Rating Books


Some ratings are not included due to their low rating count. Only the book which owns more than 10 ratings can be considered.


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.770   3.950   3.929   4.130   5.000

Reviews

Top 10 Hot Topic Books


## .
##    Few Discussion Normal Discussion   Lots Discussion 
##              6485              3973              2884

Relevance

Correlation Analysis


In order to analyze the relevance between indenpendent variables and dependant variables, the tool I need is the correlation matrix. As a result, I use the DataExplorer package to present it.


As it shows on the matrix, the Page Number and the Ratings Count have 0.86 positive correlation, which means the book which owns more pages can have more ratings, or people would be more likely to rate thick books. Besides, the Page Number is also connected to the Average Rating, the correlation equals to 0.18, which is still more than 0.05. It proves that people tend to give higher score to the books which have more pages. The author’s reputaion also influence a little bit about the average rating (correlation coeffience is 0.07 & -0.07, which is not in the range of 95%), however, it is not as significant as other two factors.



Conclusion


Data helps us learn more about the popularity of books and writers.

Analysing the dataset from one of the most biggest reading website makes us have a clear thought about the books and writers.

We discussed the factors which influence the average ratings of books on Goodreads. People love talking about river novels, espacially high rating ones. There are interactions among the books’ pages number, books rating scores and the number of review writing.


Will you choose books and writers by popularity?

Like I mentioned before, I want to make my own reading list according to the data statement. I will try to read some books of the top 10 writers whom I didn’t know before, however, I still think that choosing books always relies on personal tastes. How about you, will you try to read some books after reading this report?

Thanks for reading and keep on reading!